Libraries used

Data processing

All usage event (Actual and Expected gallons) data across all zones is padded to 1-second interval data, assuming constant flowrates. Within a water source, Actual and Expected Gallons are then aggregated to the minute across all Zones. THe meter data is similarly aggregated to minute-level data. Then, minute-level usage events can be matched to minute-level meter data.

Minute-Level Analysis

For each minute for each water source, the difference between Actual (metered) and Expected gallons is computed by simple subtraction. The Actual and Expected Gallons time series plots below are interactive via zoom and pan. Between March and October 2020, there were 31,574 minutes with scheduled usage in Water Source 376, and 77,907 minutes in Water Source 377. With a defined threshold of \(Actual - Expected = 1gpm\) ,

There are 22,476 usage events (6,755 in Water Source 376, 15,721 in Water Source 377). These events cover 109,481 minutes, but 189,544 minute-ZoneId pairs. If every minute where metered usage is more than 1 gallon over the expected usage is flagged, 46% of minutes would be flagged. Increasing this threshold to 5 gallons reduces the flag percentage to 33%. Minute-level flags are impractical and misleading due to simple noise and uncertainties around whether assuming a constant flowrate throughout a given usage event is reasonable.

To what timescale should usage events and/or meter readings be aggregated to to meaningfully compare metered (actual) and expected usage? Usage events are not suitable because usage events within the same Water Source often temporally overlap due to being associated with different zones. The next aggregation level available appears to be the “Summary Level Id”, which identifies clusters of usage events that do not overlap others. Summary Level Ids cover usage events across multiple water sources, so the unit of analysis chosen is the Summary Level - Water Source pair. There are 198 such Summary Levels for Water Source 376, and 300 for Water Source 377. These range in duration from 1 to ~800 minutes. Below is a time series chart showing the Expected and Metered GPM in each Water Source Id by Summary Level.

Evidently, most Summary Levels have lower average metered GPM than expected GPM. Tabulating below at several GPM thresholds:

WaterSourceId total flags_1gpm flags_2gpm flags_3gpm flags_4gpm flags_5gpm flags_10gpm
Water Source 376 198 19 16 7 6 5 2
Water Source 377 300 26 21 19 14 12 6

Below is table of all Summary Level - Water Source pairs and the difference between metered GPM and expected GPM over the course of each pair.

Below is table of all Summary Level - Water Source pairs and the Zones they correspond to.

slz <- a %>% select(WaterSourceId, SummaryLevelId, ZoneId) 
slz$WaterSourceId <- paste0("Water Source ",slz$WaterSourceId)

slz <- left_join(slz,sl,by=c("WaterSourceId","SummaryLevelId")) %>% select(WaterSourceId,SummaryLevelId,ZoneId,DiffMeteredGPM,StartTime,EndTime)

slz <- slz %>% group_by(WaterSourceId,SummaryLevelId,ZoneId) %>% summarize(DiffMeteredGPM=mean(DiffMeteredGPM), StartTime = min(StartTime), EndTime= min(EndTime))
## `summarise()` regrouping output by 'WaterSourceId', 'SummaryLevelId' (override with `.groups` argument)
datatable(slz, extensions = c('Buttons','Scroller'), filter = list(
  position = 'top', clear = FALSE
), options = list(
    dom = 'Bfrtip',
    paging = TRUE,
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
    deferRender = TRUE,
  scrollY = 200,
  scroller = TRUE)) %>% formatRound(columns=c('DiffMeteredGPM'), digits=3)